{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true,
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "# 使用随机森林 用一年的过去天气数据来预测我们城市明天的最高温度"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "features = pd.read_csv('../part03-neural-network/csv-data/temps.csv')\n",
    "features.head(5)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "数据解释\n",
    " * year：2016年所有数据点\n",
    " * month：一年中的月份数\n",
    " * day：一年中的某一天的数字\n",
    " * week：星期几\n",
    " * temp_2：2天前的最高温度\n",
    " * temp_1：1天前的最高温度\n",
    " * average： 历史平均最高温度\n",
    " * actual：最高温度 真实值"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "print('The shape of our features is:', features.shape)\n",
    "features.describe()"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## 数据预处理\n",
    "1. 查看数据的维度，我们注意到只有348行，但其实2016年有366天。通过NOAA的数据，我注意到几个缺失的日子\n",
    "2. week 是字符，可以使用热编码，转成数字型"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "features = pd.get_dummies(features)\n",
    "print('The shape of our features is:', features.shape) # 数据形状现在是349 x 18\n",
    "features.iloc[:,5:].head(5)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## 将数据转换为数组"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "labels = np.array(features['actual']) # Labels are the values we want to predict\n",
    "features= features.drop('actual', axis = 1) # Remove the labels from the features\n",
    "feature_list = list(features.columns) # Saving feature names\n",
    "features = np.array(features) # Convert to numpy array\n"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## 将数据分成训练和测试集"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size = 0.25, random_state = 42)\n",
    "print('Training Features Shape:', train_features.shape)\n",
    "print('Training Labels Shape:', train_labels.shape)\n",
    "print('Testing Features Shape:', test_features.shape)\n",
    "print('Testing Labels Shape:', test_labels.shape)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "训练和评估之前，需要建立一个基线作为明智的衡量标准"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "# The baseline predictions are the historical averages\n",
    "baseline_preds = test_features[:, feature_list.index('average')]\n",
    "# Baseline errors, and display average baseline error\n",
    "baseline_errors = abs(baseline_preds - test_labels)\n",
    "print('Average baseline error: ', round(np.mean(baseline_errors), 2))"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## 使用Scikit-learn创建和训练模型\n",
    "\n",
    "1. 从skicit-learn导入随机森林回归模型\n",
    "2. 实例化模型，并在训练数据上拟合（scikit-learn的训练名称）模型\n",
    "3. 再次设置随机状态以获得可重现的结果"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "from sklearn.ensemble import RandomForestRegressor\n",
    "# Instantiate model with 1000 decision trees\n",
    "rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)\n",
    "rf.fit(train_features, train_labels);"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## 对测试集进行预测"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "predictions = rf.predict(test_features)\n",
    "# Calculate the absolute errors\n",
    "errors = abs(predictions - test_labels)\n",
    "# Print out the mean absolute error (mae)\n",
    "print('Mean Absolute Error:', round(np.mean(errors), 2), 'degrees.')\n",
    "# 平均预估值为3.83度。 这比基线平均提高了1度。 虽然这看起来似乎并不重要，但它比基线好近25％，现实生活中，基线可能代表公司数百万美元。"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## 确定性能指标\n",
    "为了正确看待我们的预测，我们可以使用从100％中减去的平均百分比误差来计算准确度。"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "# Calculate mean absolute percentage error (MAPE)\n",
    "mape = 100 * (errors / test_labels)\n",
    "# Calculate and display accuracy\n",
    "accuracy = 100 - np.mean(mape)\n",
    "print('Accuracy:', round(accuracy, 2), '%.') # Accuracy: 93.99 %."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## 解释模型和报告结果\n",
    "这个模型如何得出价值？ 有两种方法可以深入了解随机森林：\n",
    "  首先，我们可以查看森林中的单个树，\n",
    "  其次，我们可以查看解释变量的特征重要性。"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "### 可视化单个决策树\n",
    "Skicit-learn中随机森林实施中最酷的部分之一是： 可以实际检查森林中的任何树木。\n",
    "\n",
    "我们将选择一棵树，并将整棵树保存为图像"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "source": [
    "# Import tools needed for visualization\n",
    "from sklearn.tree import export_graphviz\n",
    "from IPython.display import Image\n",
    "import pydotplus\n",
    "# Pull out one tree from the forest\n",
    "tree = rf.estimators_[5]\n",
    "\n",
    "dot_data = tree.export_graphviz(clf, out_file=None,\n",
    "                         feature_names=iris.feature_names,\n",
    "                         class_names=iris.target_names,\n",
    "                         filled=True, rounded=True,\n",
    "                         special_characters=True)\n",
    "graph = pydotplus.graph_from_dot_data(dot_data)\n",
    "Image(graph.create_png())"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "# 使用 pydot 可以\n",
    "import pydot\n",
    "# Export the image to a dot file\n",
    "export_graphviz(tree, out_file = 'tree.dot', feature_names = feature_list, rounded = True, precision = 1)\n",
    "\n",
    "# Use dot file to create a graph\n",
    "(graph, ) = pydot.graph_from_dot_file('tree.dot')\n",
    "\n",
    "# Write graph to a png file\n",
    "graph.write_png('tree.png')\n"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}